Reliability of an automated gaze‐controlled paradigm for capturing neural responses during visual and face processing in toddlerhood

Abstract Electroencephalography (EEG) has substantial potential value for examining individual differences during early development. Current challenges in developmental EEG research include high dropout rates and low trial numbers, which may in part be due to passive stimulus presentation. Comparability is challenged by idiosyncratic processing pipelines. We present a novel toolbox (“Braintools”) that uses gaze‐contingent stimulus presentation and an automated processing pipeline suitable for measuring visual processing through low‐density EEG recordings in the field. We tested the feasibility of this toolbox in 61 2.5‐ to 4‐year olds, and computed test–retest reliability (1‐ to 2‐week interval) of event‐related potentials (ERP) associated with visual (P1) and face processing (N290, P400). Feasibility was good, with 52 toddlers providing some EEG data at the first session. Reliability values for ERP features were moderate when derived from 20 trials; this would allow inclusion of 79% of the 61 toddlers for the P1 and 82% for the N290 and P400. P1 amplitude/latency were more reliable across sessions than for the N290 and P400. Amplitudes were generally more reliable than latencies. Automated and standardized solutions to collection and analysis of event‐related EEG data would allow efficient application in large‐scale global health studies, opening significant potential for examining individual differences in development.

efforts to collect neurocognitive data in early development remain rare, in part due to methodological challenges related to scalable measurement of brain functioning in infants and toddlers (Azhari et al., 2020;Brooker et al., 2019;Byers-Heinlein et al., 2021;Noreika et al., 2020).
One method that can provide temporally sensitive measures of brain functioning is electroencephalography (EEG). Ongoing EEG reflects the synchronized activity of large populations of neurons with a millisecond temporal resolution (Lopes da Silva, 2013). EEG methods are particularly suitable for measuring neural responses in awake infants and toddlers as there are few movement restrictions, in particular when using wireless EEG (Lau-Zhu et al., 2019). In event-related approaches, ongoing EEG is time-locked to stimulus presentation and averaged across multiple trials (Luck, 2014). The resulting sequence of peaks and troughs (event-related potential [ERP]) represents neural activity reliably elicited by a particular auditory, tactile, or visual stimulus.
Although neural responses to different stimulus categories have been widely studied in the lab, the published literature often disproportionately reflects small groups of infants or young children from high-income countries (HIC), who may not yield robust, generalizable insights into development and are not suitable for studying individual differences (Henrich et al., 2010). Moving key experimental techniques to field settings where data can be collected at scale is important to ensure greater sample diversity, increased sample size, and greater translation of methodological innovation to population studies. To do this, we must address several challenges. First, traditional ERP paradigms with infants and toddlers are often associated with high dropout rates with a range of 21%-50% (De Haan et al., 2002;Halit et al., 2003;Jones et al., 2016;Stets et al., 2012;van der Velde & Junge, 2020;Webb et al., 2011). Such high attrition rates lead to small sample sizes and constrain analyses of individual differences. For example, analytic methods whose statistical power is affected by cluster size (e.g., random effects models) are often utilized in developmental and global health studies with independent sampling of specific subpopulations (e.g., infants with elevated familial likelihood of autism, or learning disability). Reliable estimates of individual differences with such methods are maximized with appropriate numbers of participants (Austin & Leckie, 2018). In addition, high attrition rates could mean that included samples are biased toward more attentive children, reducing generalizability. Second, to maximize inclusion rates traditional analysis choices for infants and young children use thresholds of 10 trials for inclusion (vs. 20 or 30 in adult studies; Hämmerer et al., 2013;Huffmeijer et al., 2014). Although infant ERP components often show larger amplitudes compared to adults, it is unknown whether a lower number of trials compromises reliability of the measured neural responses. Third, there is low consistency in the selected parameters and manual steps taken during data analyses between different studies, labs, and even experimenters within labs (Noreika et al., 2020). Some automated EEG preprocessing pipelines have been developed, such as the HAPPE pipeline (Gabard-Durnam et al., 2018), the MARA pipeline (Winkler et al., 2014), the MADE pipeline , the EEG-IP-L pipeline (Desjardins et al., 2021), and the adjusted ADJUST algorithm . However, these have been developed for high-density EEG systems; although high-density EEG allows assessment of additional metrics like connectivity and source analysis, systems are currently costly and there are few portable versions available that are suitable for use in the field. Large-scale studies incorporating assessment of single ERP features do not require high-density systems, and there is thus an additional need for pipelines that are tailored for low-density/low-cost EEG (Lau-Zhu et al., 2019).
The current project tests the feasibility of a fully automated lowcost approach to collect event-related EEG data for use in field settings.
The project comprises a toolbox-Braintools-that utilizes gazecontingent stimulus presentation. Gaze-contingent stimulus presentation allows for data collection paced by the participants themselves, as stimuli are only presented when participants are looking at the screen. With this approach, data collection is tuned to the attentional resources of the participants and should theoretically reduce dropout rates and increase trials numbers and thus data availability. In addition, the Braintools toolbox includes scripts for automated harmonized analyses of the data that are not based on visual inspection by researchers (Conte et al., 2020). Braintools is designed for use with a low-cost, low-density portable, and wearable EEG system to enable scalability, and thus automated pipelines do not rely on techniques like independent component analysis (ICA) or principle component analysis (PCA) that typically require high-density arrays (Winkler et al., 2014).
The Braintools paradigm has been implemented in studies in both HIC (United Kingdom) and low-income countries (LIC; India and The Gambia) to examine its potential for global health implementations.
The Braintools toolbox includes a range of visual and auditory tasks commonly used in developmental research. Here, we focus on the visual task as visual processing may be a suitable domain for examining individual differences during early development. The rapid development of early visual processing is partly experience dependent, and becomes faster and more efficient with increasing age (Geldart et al., 2002;Röder et al., 2013). The visual cortex develops rapidly during infancy and refines into childhood, supporting the development of visual acuity, contrast sensitivity, and binocularity (Braddick & Atkinson, 2011;Leat et al., 2009;van den Boomen et al., 2015). Visual stimuli can also be readily controlled in experimental designs. Together, these features of visual processing make the domain a good potential marker for examining individual developmental differences.
Event-related EEG designs can be used to study both early-stage components associated with domain-general visual cortical processing, and later-stage components associated with the development of domain-specific experience-dependent expertise. Early stage cortical visual processing is often indexed by the P1, a positive deflection around 100 ms after stimulus onset at occipital electrodes that is associated with low-level sensory processing of visual stimuli (Rossion & Caharel, 2011) and can be most strongly elicited by high-contrast and mid-spatial frequency stimuli such as checkerboards (Benedek et al., 2016). Later-stage components are more sensitive to more complex higher level processing, such as detecting and discriminating faces.
Young infants orient to faces from birth (Johnson et al., 1991), but face processing continues to develop into adolescence (Kilford et al., 2016).
Faces provide important communicative cues that are critical during social communication and interaction (Frith & Frith, 2007). Rapid and efficient face processing, and the ability to extract subtle information from faces is therefore key for social functioning. Event-related EEG designs are widely used to study face processing, particularly through measuring the face-sensitive N170 component at parietal electrodes in adults (Bentin et al., 1996). The N290 and the P400 components are thought to represent the infant and toddler precursor of the adult N170 (De Haan et al., 2002;Halit et al., 2003). The N290 is a negative deflection around 290 ms after stimulus onset, while the P400 is a positive deflection around 400 ms after stimulus onset. The N290 shows a faster latency and larger amplitude (more negative) response to face stimuli than nonface stimuli, like cars, objects, and houses (Kuefner et al., 2010). The P400 shows a smaller amplitude (less positive) for faces than nonface stimuli (Conte et al., 2020;Di Lorenzo et al., 2020;Jones et al., 2017). Furthermore, the N290 amplitude is larger for inverted than upright human faces, while the N290 latencies and P400 amplitudes are similar for both orientations in 12-month-old infants (De Haan et al., 2002). Toddlers, however, show a larger P400 amplitude for inverted compared to upright faces (Peykarjou et al., 2013), whereas the N290/N170 amplitude and latency show no modulations by orientation (Henderson et al., 2003;Peykarjou et al., 2013). Children and adolescents show longer N170 latencies for inverted than upright faces, while the inversion effect on N170 amplitude increases with age showing more negative amplitudes for inverted faces (Itier & Taylor, 2004b). The N290 and P400 have often been the measures of interest in studies examining developmental trajectories in young typically developing children and young children with developmental disorders (Bhavnani et al., 2021).
Although studies using EEG to examine categorical response differences in groups of infants and toddlers have provided clear insights into brain development, using EEG measures to assess individual differences brings additional challenges. Research into individual differences requires robust signals at an individual level that are stable and reliable over brief time windows. Such measures could help track early development of visual processing in relation to early adversity, for example, in large population studies or in the context of global health research.
Encouragingly, previous studies have found test-retest reliabilities in typically developing infants at values of 0.76 for N290 mean amplitude, and 0.56 for P400 mean amplitude in response to faces over a 1-to 2-week interval (Munsters et al., 2019), and in typically developing 6-to 12-year-old children at values of 0.80 for P1 peak amplitude in response to checkerboards and 0.77 for N170 peak amplitude in response to faces over a 6-to 8-week interval (Webb et al., 2020). However, similar values for toddlers are not available as this has traditionally been an age range in which data are very difficult to collect. Early event-related measures may have the potential to help identify young children who show atypical developmental trajectories and to predict later outcomes. For instance, individual variability in face processing in young children may predict later behavioral outcomes during toddlerhood and mid-childhood, such as autism (Elsabbagh et al., 2009;Shephard et al., 2020). To examine individual differences in visual processing across early development in a reliable and robust way, there is a need for novel methods that reduce dropout rates, are reliable, increase data availability, and standardize data analyses.
Here, we examined the feasibility and test-retest reliability of the Braintools toolbox in typically developing 2.5-to 4-year-old toddlers in the United Kingdom as a first step. We examined P1 components to checkerboards, and N290 and P400 components to faces (upright and inverted orientation). We focused on early childhood because this age range can be the most challenging to test (Brooker et al., 2019); a recent review noted there have been no reliability studies during toddlerhood (Brooker et al., 2019); and this is the age range in which many neurodevelopmental disorders become initially apparent in behavior, making it an important age range for large-scale studies of child development. Toddlers were tested twice over a 1-to 2-week interval with a low-density EEG system. We chose 1-2 weeks as the interval because shorter intervals might lead to repetition effects in the neural responses and data loss. Young children are less interested in stimuli that they have recently seen. This results in lower numbers of artifact-free trials and subsequently into smaller sample sizes for testretest reliability analyses (Haartsen et al., 2020). In contrast, longer intervals may reflect developmental change rather than stability of the measures (Blasi et al., 2014;Haartsen et al., 2020). We estimated the reliability of measures of P1, N290 and P400 peak latency, peak amplitude, and mean amplitude. As both previous research with infants (Munsters et al., 2019) and adults (Huffmeijer et al., 2014) suggests peak latency measures are not robust, we additionally examined the utility of a method called "dynamic time warping" (DTW), which rather than relying on the identification of individual peaks, warps an individual EEG signal until it matches a reference signal to identify the general delay between the two (Zoumpoulaki et al., 2015). We computed intraclass correlations (ICCs) of the ERP measures across different numbers of included trials in order to make recommendations about the minimum trial numbers needed for stable estimates. We furthermore calculated ICCs within sessions or split-half reliability to examine internal stability at different numbers of trials.

Participants
Sixty-one (34 female) typically developing full-term children were recruited when they were between 30-and 48-month old from the Greater London area via the Birkbeck Babylab database. Participants were invited to attend two visits to the lab with an interval of 1-2 weeks between visits. Parents/caregivers gave written informed consent upon arrival at the Babylab and after the study was explained to them by the researchers. They also filled out a demographic questionnaire, a medical questionnaire, and a language questionnaire.
The group of children in this whole recruited sample had an average age of 38.36 months (SD = 4.67, ranging from 30 to 49 months) at the day of their first visit. The second visit took place for 51 of these children with an average 10 days after the first visit (SD = 5, ranging from 7 to 28 days). Annual household income for the families was below £40,000 for 25% of the children in the recruited sample, between £40,000 and £99,999 for 44%, and above £100,000 for F I G U R E 1 The Braintools paradigm. Children participated in the Braintools study wearing an Enobio EEG system (a). While their EEG was recorded, they watched the FastERP task where checkerboards, faces, and animals being presented in blocks of trials (b   for channel layout). We selected a low-density array because for EEG with children with hair (i.e., not babies), the amount of time required to appropriately seat electrodes and ensure a good quality signal scales with the number of electrodes. Larger arrays can thus increase dropout rates for children with limited patience for cap placement, and also can be challenging to incorporate in large-scale studies where time is essential. Given our focus on selected ERP signals measured at known locations, we were able to utilize a low-density array focused on the key locations of interest: Oz on the occipital area to measure the P1 component (Benedek et al., 2016;Rossion & Caharel, 2011), and P7 and P8 over parietal areas to measure the N290 and P400 components (Conte et al., 2020;De Haan et al., 2002;Di Lorenzo et al., 2020;Halit et al., 2003;Jones et al., 2017).

Stimuli and apparatus
The CMS and DRL electrodes were attached to a clip on the participants' ear. For participants who refused to wear the ear clip, the CMS and DRL electrodes were placed on the mastoid with stickers (N = 8).
Data were recorded with a 500 Hz sampling rate using Neuroelectrics NIC 2.0 software (Barcelona, Spain). Data quality during the session was monitored using the Neuroelectrics Quality Index (QI) rather than impedance check (impedance check feature is not incorporated in the EEG system used here). The QI is calculated from line noise (power in 49-51 Hz range), main noise (power in the 1-40 Hz frequency range), and the offset of the signal every 2 s. NIC software shows the color codes of the QI for each channel with green for low QI and good data, orange average data quality, and red for high QI and low data quality.
Each session was recorded with a webcam (HD Pro Webcam C920) and Open Broadcaster Software (OBS).

Procedure
Procedures were identical for both sessions and were performed by two researchers per session (EB, TDB). During the experiment, children were seated on their parent or caregiver's lap at approximately 60 cm from the screen. The EEG cap and ear clip were positioned on the children while they watched a cartoon video of their choice until all of the EEG signals were of sufficient quality. The researchers aimed to improve data quality until each of the QI values for the EEG channels were green (or most green and some orange depending on the state and tolerance of the participant). Parents/caregivers were asked to wear a pair of plastic shutter glasses to ensure the eye-tracker picked up only the child's eye gaze. After a successful five-point ET calibration, the experimenters proceeded with the Braintools battery (or if at least four of five points were successfully calibrated, defined per calibration point as accuracy and precision both less than 2.5 • of visual angle from at least one eye). During this battery, the visual task was presented in blocks, and music was played throughout the session in order to keep the children engaged. Researchers further attempted to improve the signal quality when they noticed EEG quality dropped (by inspection of the EEG signal or red color codes).
Blocks and trials for the visual task were presented in a gaze contingent fashion; trials were only presented when the children were looking at the screen and presentation paused when the children looked away. Researchers asked the toddlers to name or count the animals they were seeing out loud. If the children were not looking at the screen for a prolonged period of time, experimenters tried to re-engage the children's attention to the screen by playing a short auditory attention getter.

Assessment of visual attention
Gaze was extracted from the ET recording during the presentation of the stimulus for each trial (0-500 ms). We interpolated gaps with a duration of 150 ms or less and then calculated the number of valid trials for the FastERP task. A trial was considered valid if the proportional looking time during the stimulus presentation was 50% or above. We then calculated the percentage of valid ET trials relative to the total number of presented trials in the ET data as a measure of attentiveness during the whole session. Based on these data, we created a subsample of children who were very attentive (≥60% trials attended for both sessions) within which to additionally assess reliability.

Data preprocessing
Data were preprocessed using a combination of in-house written Continuous data were segmented into trials from −100 to 600 ms after stimulus onset. Then trials were split into two datasets by condition (checkerboards or faces) in order to apply task-specific processing steps. For the checkerboard trials, signals for the occipital channel Oz, and C3, C4, and Cz (for later re-referencing) were filtered using a 0.1-40 Hz bandpass filter (with 3 s padding on both sides) to filter out high-frequency noise from muscle artifacts and a dft filter (at 50, 100, and 150 Hz) to filter out the still remaining residual line noise due to the shallower slope of the bandpass filter. Data were baseline corrected with −100 to 0 as baseline window. Trials were marked bad if the signal was flat (if amplitude did not exceed 0.0001 µV) or exceeded thresholds of −150 and 150 µV (selected based on initial inspection of data quality at different thresholds, see SM2). A channel was excluded if artifacts were present for 80% of the trials or above and a trial was excluded if the signal for Oz included artifacts. Next, we re-referenced the data on a trial-to-trial basis to Cz, or the average of C3 and C4 if Cz contained artifacts. If channels C3 and/or C4 contained artifacts as well, the whole trial was excluded from further analysis.
For the face trials (upright and inverted, all ethnicities), we selected parietal channels P7 and P8, and Cz, C3, and C4 for later rereferencing. Filters, baseline correction and artifact identification were identical to those used for the checkerboard trials. A channel was excluded if the signal was bad for 80% of the trials or above and a trial was excluded if the signal for P7 and/or P8 contained artifacts. Finally, the re-referencing procedure for face trials was identical to the procedure for the checkerboard trials.
The animal trials were intended to maintain toddler attention in the paradigm. For completeness, we preprocessed these trials with the same parameters as the face trials. Further visual inspection of the grand averages of the animal trials indicated that the event-related responses elicited were weak (see SM3). These trials may not provide comparable neural responses because toddlers sometimes named or counted the animals; and because the icons chosen had lower visual complexity than the faces. We therefore excluded the animal trials from further analyses. was identified within windows of 286-610 ms in 3-to 4-year olds (Dawson et al., 2002), and 300-600 ms in 4-year olds (Jones et al., 2018). We adjusted these time windows for our age and setting by examining grand averages of the components across all available data sessions with 10 artifact-free trials or more (note; we included both test and retest datasets and used a cutoff of 10 trials in accordance with previous ERP studies; Jones et al., 2018;Webb et al., 2011).

Extracting ERP features
P1 was measured at Oz, and N290/P400 at the average of the P7 and P8. As an additional check for individual averages with limited signal content, we excluded individually averaged time series from the grand average where amplitudes did not exceed 1 or −1 µV. We then confirmed that the peaks in the grand averages were within the peaks from the previous studies. We adjusted the end of the N290 and start of the P400 window to decrease the overlap, and the end of the P400 window to match up with the offset of the stimulus. Final windows were: P1-most positive peak between 50 and 200 ms after stimulus onset; N290-most negative peak between 190 and 350 ms after stimulus onset; P400-between 300 and 500 ms after stimulus onset (see Figure 2).
To examine test-retest reliability across different numbers of trials, we randomly selected subsets of trials from the clean datasets for Peaks for the individually averaged ERPs were identified with a peak identification algorithm. First, all positive peaks were identified by testing at each EEG sample (sample X) if the amplitude was larger than the amplitude of following EEG sample (sample X + 1). If the following sample (sample X + 1) had the same amplitude as the current sample (sample X), the algorithm tested whether the sample after (sample X + 2) had a lower amplitude (negative deflection) and identified the following sample (sample X + 1) as a positive peak. If the following sample (sample X + 1) had a smaller amplitude than the next sample (sample X + 2), this sample was considered as part of a plateau and the algorithm skipped to the next sample. Second, all negative peaks were identified by testing at each EEG sample (sample X) if the amplitude was smaller than the amplitude of the following sample (sample X + 1). If the following sample (sample X + 1) had the same amplitude as the current sample (sample X), the algorithm tested whether the sample after (sample X + 2) had a higher amplitude (positive deflection) and identified the following sample (sample X + 1) as a negative peak. If the following sample (sample X + 1) had a larger amplitude than the next sample (sample X + 2), this sample was considered as part of a plateau and the algorithm skipped to the next sample. The result of this algorithm was a list of all positive and negative peaks identified throughout the ERP waveform.
From this list of peaks, the peaks of interest were selected. For the P1 peak, all positive peaks were selected occurring in our time window of interest (50-200 ms). We selected the identified peak as P1 peak if there was only one peak within the window. If multiple positive peaks were present during the time window, we selected the peak with the largest amplitude as the P1 peak. If no positive peaks were present in the time window, we widened our window with 20 ms on either side (30-220 ms) and selected the only peak or largest peak if multiple peaks were present. In the instances where no peaks were found in the wide window either, we noted down that no P1 peak was identified and the P1 was considered invalid. These waveforms were excluded from further analyses (to ensure sample sizes for reliability analyses were consistent across P1 features).
For the N290 peak, we selected all negative peaks occurring during our time window of interest (190-200 ms). As for the P1 peak selec-tion, we selected the negative peak as N290 peak if only one peak was present during the time window. Otherwise, we selected the negative peak with the largest amplitude. If no negative peak was present, we widened our window with 20 ms on either side (170-370 ms). The only peak or peak with largest amplitude was selected as N290 peak. If no negative peaks were identified in this process, we considered this N290 as invalid and the waveform was excluded from further analyses.
For checkerboards, we extracted P1 peak latency and peak amplitude (averaged across a 60-ms window centered around the P1 peak by the algorithm described above and if valid). For faces, we extracted N290 peak latency, peak amplitude (average amplitude across a 60-ms window centered around the N290 peak by the algorithm described above and if valid) and mean amplitude (190-350 ms). We also calculated P400 mean amplitude (300-500 ms); we did not compute peak latency or amplitude because the P400 has a wider peak morphology and its peak is consequently more difficult to identify. Identified peaks were considered likely to represent "noise" and removed from further analysis if the amplitude at the ERP peak latency did not exceed the amplitude of the largest peak of the same directionality (positive or negative) during the baseline. Note, amplitudes here were calculated to the point of the peak instead of averaged across a 60-ms window

Statistical analyses
To assess the feasibility of the paradigm, we calculated the percentage of included participants for analyses focused on visual processing and face processing relative to both the full recruited sample; and relative to the sample who had data from at least one EEG session. We report both percentages for completeness to reflect retention rates when including and excluding participants whose sessions had technical issues or where participants refused to wear the EEG cap.
Next, we tested the face processing condition effects in our sample of toddlers with a first EEG session. We performed a paired t-test comparing ERP features between upright faces and inverted faces (N290 peak latency, peak amplitude, and mean amplitude, P400 mean amplitude, and DTW direction during the N290 time window). We included toddlers with 20 or more artifact-free trials for each condition.
We assessed test-retest reliability of the ERP features using ICC between measures at visit 1 and visit 2 as in Haartsen et al. (2020) and van der Velde et al. (2019). We used the ICC(3,1) that is a two-way fixed model ICC suitable to measure the consistency between single scores (Salarian, 2016;Shrout & Fleiss, 1979;Weir, 2005). The ICC is calculated as: In this formula, MS R is the variance between objects (here partici-

Feasibility of the paradigm
In total, 61 toddlers were invited to take part in the study consisting of two sessions (hereafter named the recruited sample, also see flowchart in Figure 3). Fifty-two toddlers provided data for their first session without technical issues (e.g., computer issues-in this case, the For the checkerboards, 84% of the toddlers in the recruited sample and 98% of the feasibility sample had at least one trial with clean EEG data (see Table 1). Of the recruited sample, 80% had a minimum of 10 artifact-free trials, and 79% and 75% had a minimum of 20 and 30 trials.
The highest dropout rate was for a cutoff of 50 trials where only 64% of the sample would be included. The inclusion percentages for the feasibility sample were 94%, 92%, 88%, and 75% with a cutoff of 10, 20, 30, and 50 trials, respectively.
For the faces, 84% of the recruited sample had at least 1 or 10 artifact-free EEG trials. Overall, 82% of the recruited toddlers would be included with a cutoff of 20 or 30 trials. At a cutoff of 100 trials, 62% of the recruited sample would be included in any main analyses. The inclusion percentages for the feasibility sample were 98%, 96%, and 73% with a cutoff of 1 or 10, 20 or 30, and 100 trials, respectively.
For the inversion effect in faces, 84% and 98% of the recruited and feasibility sample, respectively, had more than one trial for both face conditions, and 82% and 96% of the samples could be included with a cutoff of 10 trials. For higher cutoffs of 20, 30, and 40 trials, the inclusion percentages for the recruited sample were as 77%, 72%, and 66%, respectively, and for the feasibility sample these were 90%, 85%, and 77%.
In our feasibility sample, the face inversion revealed a main effect for P400 mean amplitude (t(46) = −2.91, p = .006), where the P400 amplitude was higher for inverted faces (mean = 4.46 µV, SD = 5.13) than upright faces (mean = 2.92 µV, SD = 3.86). No face inversion effects were observed for the other ERP features (range: .100 ≤ p's ≤ .282). Demographics for the recruited, feasibility, included, and highly attentive samples are displayed in Table 2. In the included sample, 25

Reliability of ERP measures
of 38 children heard English at least 95% of the time at both home and nursery, and two children had speech delays requiring speech therapy.
In the highly attentive sample, 15 of 23 children heard English at least 95% of the time at both home and nursery, and one child had speech delays requiring speech therapy.

TA B L E 2
Descriptive data for (i) the whole recruited sample, (ii) the feasibility sample, (iii) the included sample, and (iv) the highly attentive sample Note: Mean (SD), minimum-maximum. a Data missing for 10 participants in the whole sample. b Data missing for four participants in whole sample, two participants in the included sample, and for one participant in the highly attentive sample.

Test-retest reliability of ERPs during low-level visual processing
The ICC values for measures during low-level visual processing of the checkerboards for the included and highly attentive sample are displayed in Table 3. In the included sample, ICC values were within the poor or fair range for P1 measures. Values were fair for P1 peak latency and peak amplitude for across 20 and 30 trials (ICC latency = .51 and .41 for 20 and 30 trials, and ICC peak amplitude = .46 and .44 for 20 and 30 trials), and even good for peak amplitude across 50 trials (ICC = .62).
Values for DTW during the P1 time window were good for 30 trials (ICC = .64). The pattern of ICC values in the highly attentive sample was very similar to the pattern in the included sample.

3.2.2
Test-retest reliability of ERPs during face processing The ICC values for the measures during the processing of upright and inverted faces are presented in Table 4 For the highly attentive sample, ICC values for N290 peak latency were mostly within the low range (.10 ≤ ICCs ≤ .48), similarly to the included sample. Reliability for the N290 peak amplitude and mean amplitude were more likely to be in the poor reliability ranges com-

3.2.3
Test-retest reliability of face inversion effects Table 5 displays the ICC values for the face inversion effects between the two visits across different numbers of trials included in the ERPs in the included sample and the highly attentive sample (see Table S1 for significance and direction of the face inversion effects). ICC values for the condition effects in the reliability sample were overall within the poor reliability range. For N290 peak latency and peak amplitude, ICC values ranged from −.24 to .32 and only the latter reached significance (for peak amplitude across 10 trials). ICCs for mean amplitude measures were also within the poor range (−.33 ≤ ICCs ≤ .36), with the exceptions of the measures across 60 trials with ICC = .55 for the N290 mean amplitude, and ICC = .49 for the P400 mean amplitude in the fair TA B L E 3 ICC values for EEG key metrics during low-level visual processing

Internal consistency of ERP measures
The test-retest reliability analyses (conducted across sessions 1-2 weeks apart) revealed ICC values that were mostly modest. We further examined the internal consistency of the ERP measures within each session by randomly drawing different numbers of clean trials from the datasets and splitting alternating trials into two datasets (A and B) for both the test and retest sessions separately. We then extracted each ERP component measure and calculated the internal consistency using the ICC between datasets A and B within sessions.
The results for EEG key metrics during low-level visual processing of checkerboards are presented in Table S2, and during face processing (faces upright and inverted collapsed) in Table S3.

DISCUSSION
This study set out to test the feasibility and reliability of an automated toolbox that included a gaze-contingent stimulus presentation paradigm and an automated low-density EEG processing pipeline for measuring event-related responses to visual stimuli in toddlers.

Feasibility
We invited 61 children to take part in the study. Seven children (11%) refused to wear the EEG cap at the first session. This percentage is consistent with other reports where 11% of 5-year-old participants refused to wear the cap (Brooker et al., 2019). During the sessions, acceptability toward wearing the cap varied between toddlers. Some toddlers required more explanation and demonstration before agreeing to wear the cap, for example, parents or researchers themselves demonstrating how to wear the cap. It may be helpful to supply a video or storyboard demonstrating the capping procedure in a child-friendly way to the families that they could watch prior to the visit. In addition, using age-appropriate language and comparisons to everyday objects and actions may help children accept the cap more readily, for example: "the EEG gel is similar to daddy's hair gel," "the EEG cap is like a hat," and "you can do magic with your eyes and make the next picture appear on the screen." It is good practice in development EEG research to take variability in EEG cap acceptance into account when designing and setting up an ERP study in preschoolers, particularly data loss due to cap refusal and flexibility in explaining the paradigm to the participants themselves.
Our reliability findings for visual processing of checkerboards suggest moderate reliability for the P1 component features when extracted from an ERP averaged across 20 trials or more. Based on this cutoff, we would be able to include 79% of the recruited partici-pants in analyses on P1 peak latency and peak amplitude. We would be able to include 75% of the participants for the DTW measure in the P1 time window, using a cutoff of 30 trials with good reliability values. of the participants in the recruited sample being included in analyses.
Taken together, these results suggest we can include 75%-82% of the recruited sample (or 88%-92% of the children with data at the first session without technical issues) depending on our research interests.
These inclusion rates of 75%-82% from our study are higher than in previous research in toddlers (e.g., 50% in typically developing 18-to 30-month olds (Webb et al., 2011)). These previous studies used a cutoff of 10 trials and did not focus on test-retest reliability, however.
Inclusion rates in these studies may be even lower when they would focus on reliability. Overall, this indicates that our gaze-controlled stimulus presentation paradigm with simultaneous low-density, lowcost EEG recording was successful in reducing dropout rate in toddlers.
In addition to dropout rates, we also examined the condition effects of face orientation during the face processing task. Our 2.5-to 4-year olds did not display any differences in N290 ERP features between inverted and upright faces during early processing. These findings are consistent with studies reporting a lack of face inversion effects for the N170/N290 latency and amplitude in response to photographs of adult faces in 3-year olds (Peykarjou et al., 2013) and schematic faces in 4-year olds (Henderson et al., 2003). During later face processing, the P400 mean amplitude was larger for inverted than upright faces in the current study. A similar pattern was found in 3-year olds (Peykarjou et al., 2013). The reliability of the face inversion effect for P400 mean amplitude reached modest values at 60 trials or more, suggesting 43% of the recruited sample may be included if one wants to examine individual differences in face inversion effects in this age group. It is possible that the pattern of face inversion effects for the N290 and P400 measures observed at this age is related to the pattern of reliability for these measures as discussed in the next section.

Reliability
The results of the test-retest reliability analyses revealed that reliability of the ERP components was overall moderate and was gener-  (Hämmerer et al., 2013). Another possibility may be less stability in neural processing in younger age groups. Processes may stabilize more when they get more mature and intraindividual variability may decrease (Hämmerer et al., 2013).
Indeed, our internal consistency analyses revealed higher ICC values for P1 measures than N290 and P400 measures suggesting higher signal to noise ratio and possibly more matured early-stage perceptual processing compared to later-stage face processing.
A range of factors may explain the differences in ICC values between measures and components in the current findings. First, the P1 peak is often more clearly detectable in individual ERPs than the N290 peak, as P1 peaks are sharper peaks with larger amplitudes while N290 peaks are often of wider with smaller amplitudes and may not be detectable in some individuals (Munsters et al., 2019). If selecting a measure to examine individual differences in the speed of early neural responses, P1 latency may be more reliable than the N290 latency and automated measures like DTW could provide a robust means of latency comparison. Second, mean amplitudes may be more reliable than peak amplitudes, which in turn may be more reliable than peak latencies. This is because noise overlaid on the signal will tend to be maximal at the peak, so averaging procedures across a time window will average out noise making mean amplitudes more reliable than peak amplitudes (Clayson et al., 2013). Third, differences in developmental stages may result in differences in reliability across neural responses during low-level sensory processing and face processing. Low-level sensory processing develops at a rapid pace during early postnatal development; the functional brain networks for visual processing show similar topographies to adult networks in neonates, whereas networks for higher order processing show similar topographies to adults during later postnatal periods at 1 or 2 years of age (Grayson & Fair, 2017;Keunen et al., 2017). Face processing in contrast has a protracted developmental trajectory that continues until the end of adolescence (Cohen Kadosh et al., 2013;Kilford et al., 2016). It is possible that individual differences in measures of low-level visual processing are more stable compared to measures of face processing across the 2-week interval during toddlerhood, as is also supported by our findings of higher internal consistency for low-level visual processing measures than face processing measures within sessions.
Reliability values for DTW direction were overall lower than the values for the traditional ERP measures, for example, peak latency, peak amplitude, and mean amplitude. We included DTW in our measures because it has been proposed as a more robust measure of neural processing speed than peak latency (Zoumpoulaki et al., 2015). Here, we found that this measure might be more reliable for the P1 component during visual processing where there is a clearly detectable peak, but not for the N290 component where there is a less clearly detectable peak. Thus, DTW direction is less dependent on peak identification but requires a strong waveform morphology to function as a reliable and robust measure.
Our findings further suggested that sustained attention throughout the session did not affect reliability of low-level processing measures because reliability values for P1 measures were similar across the included and highly attentive sample. In contrast, reliabilities for the N290 measures in the highly attentive sample were lower compared to those in the included sample but similar or even higher in the highly attentive than included sample for the P400 measure. Lack of differences between the included and highly attentive sample indicates that reliability does not worsen when including less attentive participants in general.
The characteristics of the FastERP task may further contribute to the reliability observed. The images of the checkerboard and four female faces were repeated throughout the paradigm. One possibility is that N290 features may be more reliable across trials with identical face stimuli. In the current analyses, we averaged across all artifactfree face trials; thus, the number of trials included for each of the four female faces likely varies across analyses. Future work could examine whether responses may be more stable when only one face stimulus is included in the paradigm. Alternatively, one could argue that identical face stimuli may lead to habituation effects that may affect N290 and P400 responses (Itier & Taylor, 2004a;Jacques et al., 2007;Nordt et al., 2016;Schweinberger & Neumann, 2016); the reliability of N290 responses evoked from trial-unique faces (Jones et al., 2016) is another option that could be explored in future work.
Analyses of the data have shown that the proposed gaze-controlled paradigm provides moderately stable estimates of event-related neural responses to checkerboards and faces during early development, comparable to those previously reported for infants (Munsters et al., 2019).
As previously mentioned, use of videos or storyboards prior to the visit may improve acceptability of wearing the EEG cap among the young participants. We noticed in the lab that toddlers were able to complete the sessions due to the flexibility of the paradigm that allowed them to take brief, frequent and self-determined breaks without the loss of data. The use of the wireless and mobile EEG system further facilitated this as toddlers could take a longer break away from the screen or even the room if needed and data collection would be paused (although this was rarely the case). Furthermore, real-time analysis of the EEG data during the session may help ensure good data quality and prevent dropout due to artifacts. These suggested improvements may enable even greater flexibility for the participants and lower dropout rates in future developmental studies.
This study has important implications for the developmental field.
First, the moderate reliability values suggest the gaze-controlled paradigm and low-density EEG processing pipeline may be suitable for developmental research. Second, the current findings are promising as low-level and low-density EEG systems are more scalable for use in the clinic and field (Lau-Zhu et al., 2019). Further research will be needed to establish the suitability of the toolbox in LMIC populations and other age groups, for example, in toddlers in India or infants in The Gambia (http://braintools.bbk.ac.uk/), or children with neurodevelopmental disorders.

Limitations
We note that our paradigm has moderate test-retest reliability. Future work needs to explore whether other EEG features or other paradigms could achieve higher levels of reliability. Furthermore, this paradigm was designed for low-density EEG systems. The advantages of these low-density systems are their relatively low cost, scalability, portability, and potential for use outside of lab environment (Lau-Zhu et al., 2019).
Lab-based EEG studies in developmental research have commonly used high-density (HD) EEG systems with 32, 64, or 128 channels (e.g., in Munsters et al., 2019;Webb et al., 2020). Recording at a high number of EEG channels allows additional analyses such as connectivity (Bullmore & Sporns, 2009;van Wijk et al., 2010) or source localization of the brain signals measured in the developmental studies (Johnson et al., 2001). Future work may examine whether other recording systems may record signals achieving higher reliability. Future development of low-cost, scalable HD systems will enable these measures to be brought into global EEG research.

CONCLUSION
In summary, we developed a novel toolbox with gaze-controlled stimulus presentation and an automated preprocessing pipeline suitable for low-density EEG systems that can be applied in large-scale samples in field settings. We showed that the toolbox is feasible for use in visual processing research in toddlers, with inclusion rates of 79% for lowlevel visual processing and 82% for face processing domains. Relevant to measures of individual differences, test-retest reliability over a 1to 2-week interval was moderate for a minimum of 20 and 30 trials for low-level visual and face processing, respectively. Test-retest reliability and internal consistency of latency measures were higher for lowlevel visual processing compared to face processing, whereas reliability and internal consistency for amplitude measures were similar or better during face processing compared to the low-level visual processing.
This suggests the speed and amplitude of low-level visual processing and amplitude measures during face processing are relatively more stable over time, and thus may be more suitable measures of individual differences in visual perception/cognition in toddlerhood. The feasibility of automated and standardized solutions for data collection and analyses with low-density EEG systems holds promise for large-scale studies and application in global health.
writing-review and editing, supervision, project administration, funding acquisition.

DATA AVAILABILITY STATEMENT
Scripts used for the gaze-controlled stimulus presentation can be accessed through contacts on the Task Engine website . The scripts used for the EEG preprocessing and reliability analyses are available on GitHub . Data analyzed in the current study are available upon request via the BOND lab website (Del Bianco et al., 2019).